Graphics in R

Basic Plots

Setup

Run the Setup.R file.

If everything works correctly, you should see a plot:

ggplot2 In a Nutshell

  • Package for statistical graphics
  • Developed by Hadley Wickham
  • Designed to adhere to good graphical practices
  • Supports a wide variety plot types
  • Constructs plots using the concept of layers
  • http://had.co.nz/ggplot2/ or Hadley’s book ggplot2: Elegant Graphics for Data Analysis} for reference material

qplot Function

The qplot() function is the basic workhorse of ggplot2

  • Produces all plot types available with ggplot2
  • Allows for plotting options within the function statement
  • Creates an object that can be saved
  • Plot layers can be added to modify plot complexity

qplot Structure

The qplot() function has a basic syntax:

qplot(variables, plot type, dataset, options)

  • variables: list of variables used for the plot
  • plot type: specified with a geom = statement
  • dataset: specified with a data = statement
  • options: there are so, so many options!

Diamonds Data

Objective: Explore the diamonds data set (preloaded along with ggplot2) using qplot for basic plotting.

The data set was scraped from a diamond exchange company data base. It contains the prices and attributes of over 50,000 diamonds.

Examining the Diamonds Data

What does the data look like?

Look at the top few rows of the diamond data frame to find out!

head(diamonds)
##   carat       cut color clarity depth table price    x    y    z
## 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

Basic Scatterplot

Basic scatter plot of diamond price vs. carat weight

qplot(carat, price, geom = "point", data = diamonds)

Another Scatterplot

Scatter plot of diamond price vs carat weight showing versitility of options in qplot

qplot(carat, log(price), geom = "point", data = diamonds, 
    alpha = I(0.2), colour = color, 
    main = "Log price by carat weight, grouped by color") + 
    xlab("Carat Weight") + ylab("Log Price")

Your Turn

All of the “Your Turns” for this section will use the tips data set:

tips <- read.csv("https://bit.ly/2gGoiLR")
  1. Use qplot to build a scatterplot of variables tips and total bill
  2. Use options within qplot to color points by smokers
  3. Clean up axis labels and add main plot title

Your Turn Solutions

Scatterplot of variables tips and total bill

qplot(data = tips, x = total_bill, y = tip)

Your Turn Solutions

Color points by smokers

qplot(data = tips, x = total_bill, y = tip, 
      color = smoker)

Your Turn Solutions

Pretty axis lables and title

qplot(data = tips, x = total_bill, y = tip, 
      color = smoker,
      xlab = "Total Bill ($)",
      ylab = "Tip ($)", 
      main = "Tip left by patrons' total bill and smoking status")

Plotting Map Data

States Data

To make a map, load up the states data and take a look:

states <- map_data("state")
head(states)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

Basic Map Data

What data is needed in order to plot a basic map?

  • Latitude/longitude points for all map boundaries
  • Which boundary group all lat/long points belong
  • The order to connect points within each group

Basic Map Data

The states data has all necessary information

A Basic Map

A bunch of latitude longitude points…

qplot(long, lat, geom = "point", data = states)

A Bit Better Map

… that are connected with lines in a very specific order.

qplot(long, lat, geom = "path", data = states, group = group) + 
    coord_map()

Polygon vs Path

qplot(long, lat, geom = "polygon", data = states, group = group) + 
    coord_map()

Incorporating Information

  • Add other geographic information by adding geometric layers to the plot
  • Add non-geopgraphic information by altering the fill color for each state
    • Use geom = "polygon" to treat states as solid shapes
    • Show numeric information with color shade/intensity
    • Show categorical information using color hue

Categorical Data

If a categorical variable is assigned as the fill color then qplot will assign different hues for each category.

Load in a state regions dataset:

statereg <- read.csv("https://bit.ly/2i0AFHK")
head(statereg)
##        State StateGroups
## 1 california        West
## 2     nevada        West
## 3     oregon        West
## 4 washington        West
## 5      idaho        West
## 6    montana        West

Joining Data

join or merge the original states data with new info

The left_join function is used for merging**:

library(dplyr)
states.class.map <- left_join(states, statereg, by = c("region" = "State"))
head(states.class.map)
##        long      lat group order  region subregion StateGroups
## 1 -87.46201 30.38968     1     1 alabama      <NA>       South
## 2 -87.48493 30.37249     1     2 alabama      <NA>       South
## 3 -87.52503 30.37249     1     3 alabama      <NA>       South
## 4 -87.53076 30.33239     1     4 alabama      <NA>       South
## 5 -87.57087 30.32665     1     5 alabama      <NA>       South
## 6 -87.58806 30.32665     1     6 alabama      <NA>       South

** More on this later

Plotting the Result

qplot(long, lat, geom = "polygon", data = states.class.map, 
      group = group, fill = StateGroups, colour = I("black")) + 
    coord_map() 

Numerical Data & Maps

  • Behavioral Risk Factor Surveillance System
  • 2008 telephone survey run by the Center for Disease Control (CDC)
  • Ask a variety of questions related to health and wellness
  • Cleaned data with state aggregated values posted on website

BRFSS Data Aggregated by State

states.stats <- read.csv("https://bit.ly/2gT95Hc")

##   state.name   avg.wt avg.qlrest2   avg.ht  avg.bmi avg.drnk
## 1    alabama 180.7247    9.051282 168.0310 29.00222 2.333333
## 2     alaska 189.2756    8.380952 172.0992 28.90572 2.323529
## 3    arizona 169.6867    5.770492 168.2616 27.04900 2.406897
## 4   arkansas 177.3663    8.226619 168.7958 28.02310 2.312500
## 5 california 170.0464    6.847751 168.1314 27.23330 2.170000
## 6   colorado 167.1702    8.134715 169.6110 26.16552 1.970501

Join the data again

states.map <- left_join(states, states.stats, by = c("region" = "state.name"))
head(states.map)
##        long      lat group order  region subregion   avg.wt avg.qlrest2
## 1 -87.46201 30.38968     1     1 alabama      <NA> 180.7247    9.051282
## 2 -87.48493 30.37249     1     2 alabama      <NA> 180.7247    9.051282
## 3 -87.52503 30.37249     1     3 alabama      <NA> 180.7247    9.051282
## 4 -87.53076 30.33239     1     4 alabama      <NA> 180.7247    9.051282
## 5 -87.57087 30.32665     1     5 alabama      <NA> 180.7247    9.051282
## 6 -87.58806 30.32665     1     6 alabama      <NA> 180.7247    9.051282
##    avg.ht  avg.bmi avg.drnk
## 1 168.031 29.00222 2.333333
## 2 168.031 29.00222 2.333333
## 3 168.031 29.00222 2.333333
## 4 168.031 29.00222 2.333333
## 5 168.031 29.00222 2.333333
## 6 168.031 29.00222 2.333333

Shade and Intensity

Average # of days in the last 30 days of insufficient sleep

qplot(long, lat, geom = "polygon", data = states.map, 
      group = group, fill = avg.qlrest2) + coord_map()

BRFSS Data by Gender and State

states.sex.stats <- read.csv("https://srvanderplas.github.io/NPPD-Analytics-Workshop/02.Graphics/data/states.sex.stats.csv")
states.sex.stats <- read.csv("https://bit.ly/2hiKFIb")
head(states.sex.stats)
##   state.name SEX   avg.wt avg.qlrest2   avg.ht  avg.bmi avg.drnk    sex
## 1    alabama   1 198.8936    8.648936 177.5729 28.50714 3.033333   Male
## 2    alabama   2 173.0315    9.224771 163.9956 29.21280 2.041667 Female
## 3     alaska   1 203.3919    7.236111 178.3896 28.91494 2.487179   Male
## 4     alaska   2 169.5660    9.907407 163.1296 28.89286 2.103448 Female
## 5    arizona   1 191.3739    5.163793 177.1724 27.63152 2.814286   Male
## 6    arizona   2 156.2054    6.142857 162.7043 26.67683 2.026667 Female

One More Join

states.sex.map <- left_join(states, states.sex.stats, by = c("region" = "state.name"))
head(states.sex.map)
##        long      lat group order  region subregion SEX   avg.wt
## 1 -87.46201 30.38968     1     1 alabama      <NA>   1 198.8936
## 2 -87.46201 30.38968     1     1 alabama      <NA>   2 173.0315
## 3 -87.48493 30.37249     1     2 alabama      <NA>   1 198.8936
## 4 -87.48493 30.37249     1     2 alabama      <NA>   2 173.0315
## 5 -87.52503 30.37249     1     3 alabama      <NA>   1 198.8936
## 6 -87.52503 30.37249     1     3 alabama      <NA>   2 173.0315
##   avg.qlrest2   avg.ht  avg.bmi avg.drnk    sex
## 1    8.648936 177.5729 28.50714 3.033333   Male
## 2    9.224771 163.9956 29.21280 2.041667 Female
## 3    8.648936 177.5729 28.50714 3.033333   Male
## 4    9.224771 163.9956 29.21280 2.041667 Female
## 5    8.648936 177.5729 28.50714 3.033333   Male
## 6    9.224771 163.9956 29.21280 2.041667 Female

Adding Information

Average # of alcoholic drinks per day by state and gender

qplot(long, lat, geom = "polygon", data = states.sex.map, 
      group = group, fill = avg.drnk) + coord_map() + 
    facet_grid(sex ~ .)

Your Turn

  • Use left_join to combine child healthcare data with maps information.
    You can load in the child healthcare data with:
states.health.stats <- read.csv("https://bit.ly/2hRBMq0")
  • Use qplot to create a map of child healthcare undercoverage rate by state

Your Turn Solutions

library(maps)
library(dplyr)
states <- map_data("state")
states.health.map <- left_join(states, states.health.stats, 
                               by = c("region" = "state.name"))

# Use qplot to create a map of child healthcare undercoverage 
# rate by state
    
qplot(data = states.health.map, x = long, y = lat, 
      geom = 'polygon', group = group, 
      fill = no.coverage) + coord_map()

Your Turn Solutions

## Cleaning Up Your Maps

Use ggplot2 options to clean up your map!

  • Adding Titles + ggtitle(...)
  • Might want a plain white background + theme_bw()
  • Extremely familiar geography may eliminate need for latitude and longitude axes + theme(...)
  • Want to customize color gradient + scale_fill_gradient2(...)
  • Keep aspect ratios correct + coord_map()

Cleaned Up Map

qplot(long, lat, geom = "polygon", data = states.map, 
      group = group, fill = avg.drnk) + 
  coord_map() +  theme_bw() +
  scale_fill_gradient2(
    name = "Avg Drinks",
    limits = c(1.5, 3.5), 
    low = "lightgray", high = "red") + 
  theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        axis.title = element_blank()) +
  ggtitle("Average Number of Alcoholic Beverages 
          Consumed Per Day by State")

Cleaned Up Map

Your Turn

Use options to polish the look of your map of child healthcare undercoverage rate by state!

Your Turn Solutions

qplot(data = states.health.map, x = long, y = lat, 
      geom = 'polygon', group = group, fill = no.coverage) + 
  coord_map() + 
  scale_fill_gradient2(
    name = "Child\nHealthcare\nUndercoverage",
    limits = c(0, .2), 
    low = 'white', high = 'red') + 
  ggtitle("Health Insurance in the U.S.\n
          Which states have the highest rates 
          of undercovered children?") +
  theme_minimal() + 
  theme(panel.grid = element_blank(), 
        axis.text = element_blank(),
        axis.title = element_blank())   

Your Turn Solutions